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ABSTRACT 

This study was designed to: (1) derive a set of 
weights for student-faculty rating item scores that maximize 
differences in mean rating scores for groups of students known to 
have experienced real differences in instructional effectiveness; and 
(2) atxempt to cross validate the new scoring methods in a second 
study of students whc rated lectures under the same conditions. 
Participants in the first study were 207 undergraduate and graduate 
students who were enrolled in general studies sections, while 
participants in the second study were 213 students enrolled in 12 
sections of an undergraduate psychology course. The four types of 
facility lectures were: (1) high information-high enthusiasm; (2) high 
in^fp ration- low enthusiasm; (3) low information-high enthusiasm; (4) 
low infotmationrlow enthusiasm. In the first study, sections of the 
same class were rali?"d^mly--dijaded, while in the second study intact 
sections of the same class cons^i^uteiL the study group. Lecturers 
were randomly assigned to student groups. Sf^irden-t.s_.in all groups 
vieii^ed one lecture presentation, rated the present a ticnTu^-ng-^aji 
18-item guestionnaire , and were tested on the material* The use ^o?~~~^~^^^ — 
discrimination analysis to develop student-rating scales that are 
valid\ with respect to faculty enthusiasm is supported by this study. 
This study also indicated that empirically based selection and 
weighting of items does not improve the validity of student rating 
scores, in detecting real differences in information giving. Current 
practices in the evaluation of teaching effectiveness are limited 
almost Entirely to proxy measures, namely, student ratings* (SK) 

4c ********* 

* Documents acguired by ERIC include many informal unpublished *, 

* materials not available from other sources. ERIC makes every effort * 

* to obtain the best copy available* Nevertheless, items of marginal * 

* reproducibility are often encountered and this affects the guality * 

* of the microfiche and hardcopy reproductions ERIC makes available * 

* via the ERIC Document Reproduction Service (EDRS) • EDRS is not * 

* responsible for the quality of the original document. Reproductions * 

* supplied by EDRS are rhe best that can be made from the .original. * 
******************************4(****:((* **************************** ****** 



DISCRIMINANT ANAI.YSIS AND CLASSIFICATION OF TEACHING 
EFFECTIVENESS USING STUDENT RATINGS: THE SEARCH FOR DOCTOR FOX 



John E. Ware, Jr. and Reed G. Williams 



March 1976 



us DEPARTMENT OP HEALTH. 
EDUCATlON4WELPAtE 
NATIONAL INSTITUTE OP 
EOUCATION 

Tui^ nOCUMENT HAS BEEN REPPO- 
nuCPO EXACUY AS RECEIVED PROM 
?HF PERSON OR DRGANIZAT.DN 0R»Om. 
IVfNO .T POINTS or VIEW OR OPINIONS 
ttltro DO NOT NECESSARILY REPRF- 
NT Of nC.A. NATIONAL 'NSTnuTE O^ 
EDUCATION POSITION OR POLICY 



P-56O0 



The Rand Paper Series 



Papers are issued by The Rand Corporation as a service to its professional staff. 
Their purpose is to facilitate the exchange of ideas among those who share the 
author's research interests; Papers are not reports prepared in fulfillment of 
Rand's contracts or grants. Views expressed in a Paperare the author's own, and 
are not necessarily shared by Rand or its research sponsors. 



The Rand Corporation 

Santa Monica, California 90406 



1 



iii 

ACKNOl^fLEDG^^ENTS 

Data gathering and analysts was supported by the Office of Research 
iind Projects, Southern Illinois University at Carbondale, Students were 
scheduled Vv/ith the cooperation of Larry Bush and John Mouw, Southern 
Illinois University at Carbondale. The preparation of this paper was 
supported by The Rand Corporation, Santa Monica, California, 90405, 

A summary of this paper was presented at the Fourteenth Annual 
Conference on Research in Medical Education, November 1975, during the 
86th Annual Meeting o£ the Association of American Medical Colleges, 
Washington , C. 



4 



DISCRIMINANT ANALYSIS MP CLASSIFICATION OF TEACHING 



EFFECTIVF.NESS USING STUDI:NT RATINGS; THE SE^\RCH FOR DOCTOR FOX 



InquLrias into the validity of student-faculty ratings and the 
effects of differences in instructors on students are rarely conducted 
using experimental methods. Only six reports of such studies could be 
found. Mdstin (1963) reported that high school students who heard lec- 
tures by "enthusiastic" teachers learned more and had more favorable 
course attitudes than students of less enthusiastic teachers. Coats and 
Smidchcns (1966) showed that '^dynamic*' lectures result in higher student 
achievei.ient than less dynamic lectures. Zelby (1974) demonstrated that 
faculty can teach so as to obtain more favorable student ratings for par- 
ticular instructor and course characteristics. Powell (1975) showed that 
students who were required to do less work and who learned less rated 
their faculty more favorably than students who were required to do more 
work and who learned more. 

The two "Doctor Fox" studies of lecture presentations (Ware and 
Williams, 1975 ; Williams and Ware, 19 76) have shown that: a) an enthu- 
siastic presentation manner results in greater student learning when 
initial motivation to learn is low, b) differences in information-giving 
produce corresponding differences in student learning levels, c) student- 
faculty ratings are valid in relation to information-giving and group 
learning when presentations are yiot given in an enthusiastic manner, 
d) the latter is >iot true when faculty presentations ax^e given enthu- 
siastically. In other words, ratings of enthusiastic presentations con- 
taining a lot of information do not differ from ratings of enthusiastic 
presentations containing ILttle or no information even though students 
who viewed high information presentations learned more. This* phenomenon , 
which has been termed "The Doctor Fox Effect," suggests that student 
ratings as commonly scored primarily reflect faculty enthusiasm (Wate 
and Williams, 1975). , • 

Throughout the Doctor Fox experiments and most correlational stud- 
ios of the validity of student ratings, simple methods of computing 
student-faculty rating scores have been employed, namely, analysis of 
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single-item scores or the simple algebraic sum of item scores. Such 
practices lag behind the scoring systems that could be developed using 
multivariate statistical techniques and experimental data. Eas/' to use 
computer programs required to perform multivariate analyses of student 
ratings (factor analysis, regression analysis, discriminant analysis, 
etc.) are readily available, however, the required experimental data 
has only re cen tly be come avai lab le and only to a limi te d de gree • 

Experiments are needed in order to precisely control faculty dif- 
ferences and to control for differences in student groups due to self- 
selection of faculty types. Precise control of differences in faculty 
characteristics is one way of establishing cHteHa of instructional 
effectiveness against which to develop and test more valid student- 
faculty rating scoring methods. Given such data, it is possible to 
develop and validate a student rating scoring algorithm which will max- 
imally detect known differences in faculty characteristics under con- 
trolled conditions. Tliese controls were achieved durin<> the **Doctor 
Fox" studies by using a carefully programmed Hollywood actor who por- 
trayed a variety of faculty types with equivalent groups of students 
(Ware and Williams, 1975; Williams and Ware, 1976), 

Tlie analyses presented in this paper were designed to take advan- 
tage of the data gathered during the "Doctor Fox" studies in order to 
determine the extent to which more sophisticated student-faculty rat- 
ing scoring methods would improve the validity of rating scores in 
relation to differences in instructional effectiveness. Specifically, 
:>the current studies were designed to: a) derive a set of weights for 
student-faculty rating item scores that maximize differences in mean 
rating scores for groups of students known to have experienced real 
differences in instructional effectiveness, and b) attempt to cross- 
validate the new scoring methods in a second study of students who rated 
lectures under the same conditions. 
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Method 

Participants 

Participants in the first study were 207 undergraduate and gradu- 
ate students who were enrolled in general studies sections. Thirty- 
three percent were males. They ranged in age from 17-42 years with a 
median age of approximately 20 years. Twenty-one percent were fresh- 
ment, 30 percent were sophomores, 28 percent were juniors, 18 percent 
were seniors and 3 percent were graduate students. Forty-two percent 
reported they were in liberal arts, biological and physical sciences. 
Other academic majors included education, engineering and technology, 
business, and home economics, Tlie analysis sample selected from the 
first study included the 115 students who experienced lectures high or 
low in information giving (Ware and Williams, 1975), 

Participants in the second study were 213 students enrolled in 12 
sections of an undergraduate psychology course. Fifty-eight percent • 
were males, Tliey ranged from 18 l;o 38 years of age with a median of 
approximately 20 years. Seven percent were freshmen, 47 percent sopho- - 
mores, 37 percent juniors, 7 percent seniors and 2 percent were graduate 
students. Forty-seven percent reported they were education students, 
25 percent liberal arts and sciences, 18 percent home economics, agri- 
culture, and business. The second study analysis sample consisted of 
the 70 students who experienced lectures high or low in information- 
giving and who were not given an incentive to learn (Williams and Ware, 
19 76), 

Faculty Characteristics , Two faculty characteristics (information- 
giving and presentation manner) that are frequently cited as operational • 
definitions of teaching effectiveness were manipulated. The specific 
definitions and controls used arc documented elsewhere and only a brief 
description will be repeated (Ware and Williams, 1975; Williams and 
Ware, 19 76), High and low amounts of information-giving were achieved 
ols-a-oio strict adherence to verbatim lecture scripts during videotaped 
lecture presentations over the same topic (the biochemistry of learning). 
Teaching points were eliminated through a modified random procedure so 
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that the lecture high in information covered 26 teaching points and the 
lecture low in information covered only four teaching points. 

Presentation manner was manipulated by programming the actor to 
give each of the two lectures (high and low information-giving) in 
either an enthusiastic or unenthusiastic manner. Levels of enthusiasm ^ 
were associated with differences in expressiveness, vocal inflection, 
friendliness, charisma, humor, and ^'personality . " The two faculty 
characteristics chosen for study (enthusiasm and information-giving) 
have been identified in previous correlational and experimental studies 
of faculty characteristics as associated with effective teaching (Coats 
and Smidchens, 1966; Rosenshine and Furst, 1971; Coffman, 1954; Isaac- 
son, et al. , 1964 ; Solomon, 1966). 

The result was four faculty lecturer types, for purposes of the 
current study, as follows: (1) high information-high enthusiasm, (2) 
high information-low enthusiasm, (3) low information-high enthusiasm, 
and (4) low information-low enthusiasm. 

Study Design . In the first study, sections of the same class were 
divided randomly. In the second study, intact sections of the same 
class constituted the study groups, in both studies, lecturer;,^ types 
were randomly assigned to student groups and the groups were shown to 
be equivalent in terms of age, sex, GPA, and a priori interest in the 
lecture topic. Students in all groups: a) viewed one lecture presen- 
tation, b) rated the presentation using an l8-item questionnaire like 
those in general use (Pohlmann, 1975), and c) were tested over the 
material (test score based on 26 multiple-choice questions). The 18 
rating items, which were scored using a five-choice response continuum, 
are fisted in Figure 1. 

Analysis Plan . Discriminant /\nalysis (Tatsuoka, 1971, Huberty, 
1975) and scudept ratings were used to solve for the linear discriminant 
functions (LDF's) that maximize differences among means for groups of 
students in the first study. In other words, the following question 
was asked: How should student-faculty ratings be scored in order to 
reduce the amount of overlap in rpean rating scores of students experi- 
encing known differences in teaching effectiveness (information-giving 
and. presentation manner)? Eighteen rating variables were used in a 
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stepwise manner in solving for -functions discriminating among the 
four groups in the first study. LDF's associated with chance prob- 
abilities of ,05 or less were considered and rating items associated 
with chance probabilities of ,05 or less were used to score retained 
LDF's. Students in the first study were classified on the basis of two 
LDF's derived in the first study with Bayesian adjustment of probabili- 
ties of group membership. The resulting classifications (predicted 
lecturer type) were compared with known classifications (actual lec- 
turer type). The independence of known and predicted classifications 
was tested using Chi-Square analysis (Siegel, 1956). 

The second phase of the analysis plan consisted of using the LDP's 
derived in the first study to classify independent groups of students 
who experienced and rated the same four lectures during the second 
study. Students m the second study were classified on the basis of 
the LDF's vJ* ^ thou t_ Ba y e s ian adjustment of probabilities of group member- 
ship. Predicted and knom classifications in the second study were 
compared using Chi-Square analysis. 

If the rating scores defined by the LDF's are valid, they Should 
result in correct ctossif ication of lecturer types a significant pro- 
portion of the time. A chance probability of .05 or less (two-tailed 
tost) was established for Type I errors in testing this hypothesis. 

Results 

^ Two significant LDF's accounting for 73 and 19' percent of *the var- 
iance, respectively, were derived in the first study. Six rating var- 
iables were associated with significant coefficients. Standardized 
coefficients asso'^iated with significant functions are presented in 
Table 1. Each significant function was interpreted by considering the 
rating item associated with the highest coefficient and items associated 
with coefficients equal to or greater than half that amount (Tatsuoka, 
1971). In the case of the first LDF» the results were straightforward, i.e 
there was one important coefficient for the rating item pertaining to 
faculty "enthusiasm." 

High positive coefficients for the second LDF were observed for 
^^.ratiiTg items pertaining to "spoke understandably," "broadened my 
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interest in the subject /^and "stressed importait material/* A high 
negative coefficient on the se cond' LDF was observed for the rating item 
pertaining to "gave examples to explain/* Thus, a high positive score . 
on the second LDF appears to indicate understandability or clarity of 
the subject matter or material presented. Given that unrelated exam- 
pies and details of studies without results were used as filler material 
in the low information lectures, it is not surprising that a^high score 
on the second LDF was associated with a low rating, for **gave examples," 

Classifications of lecturer types in the first study, using LDF's 
derived from the same data are shoxm in Table 2, Classifications wer2 
correct for both faculty characteristics for 77 of 115 students (ap|5rox- 
imately 67 percent). The classifications of lecturers in terms of 
enthusiasm (high versus low) were correct for 99 of 115 students (approx- 
imately 86 percent). Information-giving differences (high versus low) 
were correctly classified for 89 of 115 students (approximately 77 per- 
cent) , 

However, the ultimate goal of the current research was to deter- 
mine the extent to which LDF's that appear to be valid in one study are 
generalizable to groups of students other than those for whom the func- 
tion weights were derived. The results of classifications of lecturer 
types in the second study using the two LDF's derived in the first study 
are shown in Table 3, Classifications by 26 of the 69 students were 
correct ?or both faculty characteristics (approximately a 39 percent hit 
rate). Although not very impressive, these results indicate that actual 
and predicted classifications arc not independent when marginal totals 
are considered in computing expected frequencies (x^ = 8,2, df = 1, 
p < ,01, corrected for continuity). 

Differences in the validity of* the two LDF's arc apparent in Table 
3 for the two faculty characteristics manipulated in the second study. 
Classifications of lecturers in terms of enthusiasm were correct for 
53 of 69 students (approximately 77 percent), Tliis relationship repre- 
sents a validity coefficient of ,55 (p < ,001) when expressed as phi and 
a Chi-Square value of 21,07 (df = 1, p < ,001), On the other hand, 
classifications in terms of information-giving were correct for 34 of 
69 students (approximately 49 percent). This degree of association is 
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represented by a piiL C(>officienL of .07 and a Chi-Square of .30 (df « 1), 
neillier of which is .s i ;;n Lf icanl . 

Somo insLgiit: into the naluro of student errors in classifying 
faculty a:cording to in formation-giving can be gained from further anal- 
ysis of the data prosonted in Table 3. Firs-t, i t is helpful to note 
tiiat 53 students in the second study classified tlie log^urer they exper- 
ienced as high jji infomiation-giving, wliereas, only 30 hctually saw a 
higltj. info niKit ion lecture. Fourteen of the 23 errors in classification 
(i.e. , ai>proximatcly 61 percent of the errors) were made by students 
who actually saw a lecture that was low in information-giving (only four 
Le ichiiig points covered) and that was pi^esented in an enthusiastic man- 
ner. Thus, the oiajority of errors in detecting differences in Information- 
^Ivin^ appear to be the result of bias in student ratings due to differ- 
ences in faculty enthusiasm. This is an example of what has been termed the 
"Doctor Fox Effect" (Ware and Williams, 1975). 

Finally, some insight into the validity of the LDF*s on a group 
basis can be gained by way of an analysis of group centroids. Group 
centroids are presented in Table 4 and are plotted in Figure 2 for all 
eight groups. 

The two LDF^s clearly differentiate faculty presentations with 
respect to information-giving and enthusiasm in the first study. The 
centroids for groups of students >who ex'perienced high information 
lectures are high in relation to the second LDF (subject matter). Like- 
wise, centroids for groups of students who experienced enthusiastic 
lecture presentations in the first study are high in relation to the 
first LDF (enthusiasm) and groups of students who experienced lectures 
delivered so as to be low in enthusiasm in the first study are low in 
relation to the first LDF. 

The same trends are apparent fo'r student groups in the second 
study with respect to the first LDF but not the second LDF (subject 
matter). Centroids for groups of students who experienced enthusiastic 
lecture presentations in the second study are high on the fi*%st LDF as 
they should he. However, these groups are not accurately differentiated 
on the second LDF. The centroids for groups of students whc|> experieiic^d 
lecture presentations delivered so as to be low in enthusiasm in the 
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second study are low on the first LDF as they should be and are some- 
what differentiated on the second LDF as they should be. , 



Discussion 



The first ^ssue addressed in the current study concerned the val- 
iditSr of ^student ratings of faculty with respect to differences in fac- 
ulty presentation manner during lectures. Thi§ faculty characteristic 
has been discussed under a variety of names, including "charisma "per- 
soi\aiity," "dynamismj" "expressiveness," and "enthusiasm." The use of 
discrimination analysis to 'develop student^rating scales that are valid 
with respect to faculby enthusiasm is supported by the current study 
findings. A stringent cross-validation of an enthusiasm rating scale 
was successful.^ It should also be noted that a single rating scale item 
that correlates highes^f-with this scale is a valid measure of faculty 
enthusiasm. In other words, many items ^d complicated weights are not 
necessary if che correct items are used. 

Tlie second issue concerned the validity of student ratings with 
respect to differences in faculty information-giving during lecture 
presentations. It is generally accepted that, one goal of lecture pres- 
entations is the dissemination of information, fhis may not be the 
only goal but most, agree that information-giving is a goal. We had 
hoped that the usje of discriminant analysis and data gathered during 
controllbd experiments (as a means for selecting and weighting items) 
would improve the sensitivity of student ratings to actual differences 
in infonnation-giving. In order to be generally useful, such improve- 
ments would have tb be generalizable , i.e., valid across student groups. 
Unfortunately, scoring methods that increased the validity of student 
ratings with respect to differftnces in faculty information-giving were 
not valid' when used with independent groups of students who experienced 
the same lecturers. 

On the basis of previous Doctor Fox studies it has been established 
that: a) student ratings are not sensitive to amoun. of inf^nnation 
covered in lectures even though student achioTmment is affected directly 
(Ware and Williams, 19;75) ; b) sensitizing 'students to the content of 
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lectures by adding an incentive for learning i^rom the lectures Joes not 
improvo student accuracy in rating content coverage (Williams and Ware, 
L976a) ; c) accuracy of ratings of content coverage does not improve when 
students are exposed to a second, lecture under the same conditions 
(Williams and Ware, 1976b), In other words, providing students with 
MX addilioiiai^ e::poyure to the l^t^turer does not enhance the students-^ 
ability to' sec through the Doctor Fox effect » Tlie present study indi- 
cates that enpiricall/ based selection and weighting of items does not 
improve the validity of student rating scores in detecting real differ- 
ences in information-giving. 

Ilov/ever, this study and others provide some leads. In the current 
study (data not reported) , the irtiormation-^gi^ing rating scdre used 



alone was less accurate in detecting actual differences in iirformation- 
glviag than wiien used in conjunction with the enthusiasm rating scale. 
Trends in this stttdy and previous "Doctor Fox" studies indicate that 
dlffc^rences in faci.lty information-giving are easier to detect using 
rating scores when faculty members are less enthusiastic in their lec- 
ture presentations. Perhaps a valid information-giving rating proced- 
ure can be developed that uses different items and different weights 
depeading on the degree of enthusiasm detected through use of an enthus- 
iasm rating scale. , 

Improvements ^irn 'the content of items used in student-faculty rat- 
i ^ instruments may also further improve the validity of rating scores. 
,V replication of the current stud^Xvith additional items pertaining to 
content and clarity of subject matter indicates promise for improving 
the validity of rating scores in relation to differences in information- 
giving. 

Finally, it may be possible to provide simple ins true tion^^whlch- 



will enable students to more accurately rate instructor information- 
giving. This does not app<5*ar to have been tried with respect to student- 
faculty ratings of Instruction but a study by Browne and Anderson (1975) 
suggests that the idea is worth investigation. 

Current practices in the evaluation of teaching effectiveness are 
limited alnK)st entirely to proxymeasures , namely student ratings. 
Direct observation of faculty };y fellow specialists and direct measures 
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of student achievement are almost never used in the evaluation of teach- 
ing effectiveness. Students are valid observers of how enthusiastic 
faculty members are but are not valid observers of two othev important 
aspects of teaching effectiveness, namely, a) differences in lecturer 
information-giving and b) differences in their own achievement as a 
result of the lecture. 

Until student rating scales are constructed so as to be valid 
with respect to differences in faculty information-giving, we suggest 
that the best (if not the only) way to evaluate such differences in 
faculty is direct observation by trained evaluators and the best (if 
not only) way to evaluate student achievement is an achievement test in 
conjunction with proper controls. The "state of the art" is that stu- 
dent ratings of faculty are of little or no use with respect to differ- 
ences in faculty information-giving and student achievement. 
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Figure 1. Sludcnc-Faculty Rating Items^ 



The 


Lecturer: 


1. 


Spoke understandably. 


2. 


Knew if students understood him. 


3. 


Showed an interest in students. 


4. 


Increased your appreciation for the subject. 


5. 


In general, taught effectively. 


6. 


Gave several examples to explain complex ideas. 


7. 


Knew his subject matter. 


8. 


Stressed important material. 


9. 


Was an effective lecturer. 


10. 


Has a good sense of humor. 


11. 


Organized and presented subject matter well. 


,12. 


Inspired confidence in his knowledge of the subject. 


13. 


Broadened my interest in the subject. 


14. 


Explained the subject clearly. 


15. 


Increased my knowledge of the subject. 


16. 


Stimulated my thinking. 


17. 


Was enthusiastic about the subject. 


18- 


Made learning enjoyable. 



Items i-7 were scored using a five-choice response continuun rang- 
ing from "Exceptional Performance" to "improvement Definitely Needed." 
Items 8-18 were scored from responses to a five-choice response continuum 
ranging from "I strongly agree ^ith the statement" to "I strongly disagree 
with the statement." 
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Table 1 





Standardiiied Weights for Student- 

• 


•Faculty Rating Variables 


Item 




Standardized Weights'' 


No. 


Content 


LDF I 


LDF II 


1 




.040 


.336 


6. 


Gave examples to explain 


.076 


-.295 


8. 


Stressed important material 


.061 


.151 


11. 


Organized/presented well 


-.141 


-.018 


13. 


Broadened my interest 


-.147 


.222 


17. 


Was enthusiastic 


.533 


-.147 



'\bbreviated item content (see Figure 1 for verbatim items). 



Derived in the first study. 
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